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Abstract — Online Social Networks (OSN) are among the most 
popular applications in today's Internet. Decentralized online 
social networks (DOSNs), a special class of OSNs, promise 
better privacy and autonomy than traditional centralized OSNs. 
However, ensuring availability of content when the content owner 
is not online remains a major challenge. 

In this paper, we rely on the structure of the social graphs 
underlying DOSN for replication. In particular, we propose that 
friends, who are anyhow interested in the content, are used to 
replicate the users content. We study the availability of such 
natural replication schemes via both theoretical analysis as well 
as simulations based on data from OSN users. We find that the 
availability of the content increases drastically when compared 
to the online time of the user, e. g., by a factor of more than 2 for 
90% of the users. Thus, with these simple schemes we provide a 
baseline for any more complicated content replication scheme. 

L Introduction 

Online social networks (OSNs) successfully claimed their 
place among the most popular Internet services. Despite their 
success, OSNs controlled by a single entity raise issues in 
terms of privacy of content and communication. To address 
these issues, recent works |[T|-|[3l have proposed decentral- 
ized online social networks (DOSNs), providing privacy and 
autonomy. 

Privacy of user content involves two different aspects: the 
access to the data and the storage/replication of the data. 
Access to the data can be restricted through encryption ||2l, 
without requiring trust between the owner of the data and 
the intermediaries who store it. Where to store the data in 
decentralized systems is generally solved by having an external 
storage system in charge of keeping replicas of the data, for 
example through a distributed file system JU, Q. 

The main challenge in decentralization comes from guaran- 
teeing availability of the data when the owner of the data is not 
onUne Q. Availability has been studied in P2P file-sharing Ij}, 
ISl and distributed file systems H, ISj. File sharing is driven 
by popularity of content instead of social relations. Most 
P2P and distributed file systems introduce significant overhead 
when replicating data to achieve high availability, without 
sacrificing high scalability. 

Almost all existing DOSN approaches rely on external 
storage services and therefore do not study content availability. 
Those DOSNs which do not rely on external storage amount 
to exactly one, i.e. PeerSoN ||9l which proposes to use storage 



provided by a user's friends. In this paper, we study availability 
in such DOSNs that do not rely on external storage and rather 
exploit the social graph of a user As direct social friends are 
interested in a user's content, they will store it as an effect of 
their interest. Contrary to existing data replication schemes that 
incur computation and communication overhead, our implicit 
content replication scheme avoids this overhead. 

Given the lack of explicit replication of our strategy, one 
may expect that the resulting availability of a user's data is 
very limited. We show in this paper that despite the limited 
replication provided by our strategy, the availability of content 
is relatively high. Furthermore, we show that by allowing 
a limited fraction of the users to be always online, e.g., 
by utilizing their own home gateway or an external storage 
service, the content availability is comparable to existing 
OSNs. 

This paper makes the following contributions: 

• Using network traces, we study and model the connection 
patterns of today's OSN users. 

• Based on these results, we simulate and analyze the 
performance of our replication scheme. Results show that 
friend-only replication allows a surprisingly good content 
availability. 

• We study the impact of users being always online and 
show that already a small fraction leads to high overall 
content availability. 

The remainder of this paper is structured as follows. In 
Section [n] we present our notion of DOSNs, and introduce our 
replication schemes and metrics. In Section [nil we provide 
an analytical model of the availability of our schemes. We 
present our simulation approach and our models for session 
characteristics in Section|IV] before we discuss our simulation 
results in Section [V] We discuss related work in Section |Vl] 
and conclude in Section IVIII 



II. Our DOSN concept and evaluation strategy 

We first describe our concept of a Distributed Online Social 
Network (DOSN) and define some terminology. Then we 
present the content replication schemes we evaluate in this 
paper and describe our availability metrics. 



A. Terminology and System Description 

DOSN users can engage in social relations with other users, 
e.g., online friendship, follow another's activity, or subscribe 
to status updates. Independent of the type of social relation that 
connects two users, we call them friends and their relationship 
friendship. The entirety of users and their relationships form 
the social graph of the DOSN, where users and relations 
correspond to nodes and links, respectively. A user's data — 
the user's content — can be seen as the digital representation 
of a user, which is stored on a computing device and can 
be transmitted from one device to another. The content can 
contain information on location, work, education, interests, 
photos or status updates. Each user regularly produces such 
content, that she is eager to share with her friends with the 
help of the DOSN. We do not make any assumption on the 
type of data exchanged by users, but in the rest of the paper, 
we assume that the time to transfer a user's data is negligible. 
Typically content in OSNs is small in size. Moreover, large 
objects such as videos can be uploaded into the cloud (e.g. 
YouTube) and only the link to this object is shared through 
the DOSN. Thus, as we consider DOSN and not P2P file- 
sharing systems, this assumption is reasonable. 

We consider DOSNs in general, without limiting ourselves 
to a specific implementation. Yet, we focus on cases where 
there is no central server storing users' data, e. g., PeerSon 191 . 
This implies that availability is a function of users being online 
and able to serve the data of a given user 

In this paper, we concentrate on DOSN data replication 
mechanisms. Therefore, we assume that the typical functions 
of OSNs, such as finding online friends and bonding to them, 
or creating interest groups, is ensured by another component 
of the DOSN. An often discussed, typical problem in P2P sys- 
tems is the so called boot strapping problem, which describes 
the process and the related issues when a new node wants 
to join the system. We will not discuss the boot strapping 
problem in this context because it increases the complexity 
a lot while giving very little insight. Note however that our 
simulation does include nodes with only a few friends. Further 
details on boot strapping a DOSN are discussed in ||9l- 

OSNs are highly dynamic, users join and leave, friendships 
are created and destroyed. This leads to a very complex 
scenario. To reduce this complexity we examine a static 
snapshot of an OSN. We do not assume that the graph is 
static itself but look at it in a certain state. We believe this 
simplification to be reasonable as long as the simulation time 
frame is limited. Throughout our simulations we do not allow 
users to join or leave, or edges to be added or removed. We 
further assume that for the time of the simulation the data 
does not change and because of that stays valid. In this study 
we want to follow a piece of data and its distribution over 
the system and the resulting availability. If the data would be 
changed, e.g. by adding new info, this would be equal to a 
restart of one of our simulation runs. 

We consider a one-to-one mapping between a user and a 
node, and that the node corresponding to a user is always 



a replica of this user's data. In other words, nodes always 
hold a complete copy of their user's data. We further assume 
that each user uses exactly one device. We do not consider 
the case of one user using multiple devices (such as a smart- 
phone and a desktop computer), nor the case of multiple users 
sharing the same device (e. g., the family's desktop computer). 
The assumption that each device holds a full copy is valid, 
because memory is cheap and even smart phones have a 
lot of storage capacity these days. A full system to handle 
different versions of data is given in ID. This system uses 
a DHT to store a) information about available versions of 
the content and b) locations (i. e., which node) where each 
version can be found. This way a user is not necessarily tied 
to a single device. Thus, if a user utilizes multiple devices to 
access the DOSN and some device does not hold all content 
or the most recent version of the content, the DOSN knows 
about it. That way friends can make an informed decision to 
either download outdated content or wait for fresh content. Yet, 
this case might influence a user's interaction with the DOSN 
and we can neither predict nor measure the effects of such 
a feature. Therefore we exclude this case from our analysis. 
We do not consider shared devices for two reasons. Either 
they only share data when the user is online, then there is not 
differences to our "exactly-one-device"-rule. Another option 
is that the device shares the data of all its users while any 
of the users is online. For that case our assumptions actually 
constitute a worst case. 

B. Replication schemes 

To improve data availability, data can be replicated on 
multiple nodes. These nodes become replicas. Choosing good 
replicas is a crucial question that has received a lot of attention 
in the context of distributed file sharing. In the context of 
DOSN however, one structure is by definition available: the 
underlying social graph. In this paper, we build an implicit 
replication strategy from this underlying social graph. The 
rationale behind this strategy is the fact that social friends 
are natural candidates for replicating a user's data, as they are 
interested in that person. And since memory is cheap, friends 
can store a replica of an item even if they are not interested in 
it. We also assume that each user takes care to backup her own 
data. In case a friend's hardware fails, the user still has the 
full data set and can transmit it to her friend as soon as their 
hardware works again. We study three different replication 
mechanisms that constitute different ways to exploit the social 
graph of a DOSN for data replication. The example in Figure[T| 
illustrates these replication schemes for user Alice: 

No replication: In this scheme, only the user provides 
her own data. In other words, the data is available 
only when the corresponding user is online. It is 
arguably not a replication mechanism, but constitutes 
a baseline for comparison of the other schemes. In 
Figure [T] Alice's availability is equal to Alice's 
online time. This is the only case where it could be 
a difference if a user has multiple devices. If Alice 
is online with her mobile phone and adds a new data 
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Fig. 1. Our replication strategy for user Alice. Bob and Charlie are 
friends of Alice. Once two friends (e. g., Alice and Charlie, first arrow) are 
simultaneously online they exchange their data: Charlie gets Alice's data. In 
the replication scheme 1, Bob never gets Alice's data since they are never 
online together. In replication scheme 2, Bob downloads Alice's data from 
Charlie (second arrow), which leads to a better availability. 

item, this item will not be available in that moment 
from her home PC. On the other side, in that moment 
the home PC would probably not be online. In case it 
is, synchronization mechanisms like those explained 
in ||9l will take care that both devices reach the same 
state again. 

1 Direct replication only: In this scheme, the data is 
made available by the user and her friends. To be able 
to distribute user's data, friends must obtain a copy 
of this data directly from the user itself. In Figure [T] 
Alice's 1 availability is her online time, plus the 
online time of Charlie after he got a copy of Alice's 
data. Since Bob and Alice were never simultaneously 
online. Bob cannot replicate Alice's data. 

2 Indirect replication: In this scheme, friends of a 
user can collaborate with other friends to obtain this 
user's data when they are online. In Figure [T] Bob 
can in this case get Alice's data from Charlie, and 
distribute it. Alice's 2 availability is thus made by 
Alice, Charlie after he got the copy from Alice, and 
Bob after he got the copy from Charlie. 

Note that if we represent the social network by its adjacency 
matrix A, the links of A" = W,A' =A, and A^ correspond to 
replication graphs of 0, 1, and 2 strategies respectively, where 
an edge in the replication graph represents data exchange. It is 
however important to recall that data is only hosted by direct 
friends: common friends of a user might exchange the data 
of this common friend, but do not host each other's data. For 
instance in Figure [T] if Bob is not Charlies' friend, he will 
get Alice's data from Charlie, but will not host Charlie's data. 
Note also that to keep this "only direct friends are replicas 
rule", strategy 2 is the best we can do. 

C. Metrics: Pure availability and Friend Availability 

The term availability can have many definitions in the 
context of P2P and distributed systems. Definitions include 



hardware failure rate, churn rate of peers, or information about 
which parts of content can be downloaded from which peer In 
this study we consider content availability, i. e., which fraction 
of time the data of a user is available to others in a DOSN. 
The content is available when the user's node is online, or 
when a replica of the data is online. 

One option to measure the availability would be to count 
how many data requests are successfully answered in a real 
system. This would require to model and simulate user re- 
quests. Although this is possible, we refrain from doing so 
because it substantially increases the simulation complexity 
and duration. Moreover, it is difficult to find real-world data 
that includes data request behavior and derive a good model 
from it. Therefore, we rely on time-based availability metrics. 

We do not consider data updates in this study. Instead we 
assume that each user generates new content at some point 
of the simulation. We measure availability over a day (24 
hours) following the new content generation. Note, that this 
also means that the only copy of the data available in the 
system at the beginning is held by the user's node. We believe 
this constitutes a worst-case scenario. Consider Figure [T] The 
first time Bob is online he still has not downloaded Alice's 
data. We therefore consider Alice's data as unavailable. In 
reality. Bob has probably a local copy of Alice's older data 
that he can serve as well: he just does not have the latest 
update. 

Here, we do consider two different types of availability. The 
first one is the traditional content availability, as defined above: 
The fraction of time a piece of data is available. We refer to 
it as pure availability or metric Ml. 

In the context of DOSN however, we can again exploit 
the social graph. People interested in a given user's data are 
primarily her friends. Recall that any kind of OSN relationship 
defines friendship. In a news service, only the followers that 
subscribed to a feed will receive the content. In Facebook or 
Google Plus many profiles are only shared with friends, such 
that an arbitrary user cannot see the profile. 

Thus, in this paper, we also measure the friend availability 
or metric M2, which is the fraction of time a user's data is 
available when her friends are online. This availability takes 
into account who is interested in the data that is available. 
This last availability can be seen as a convergence metric 
as once all the friends of a user have a copy of the data, 
this friend availability is one, even though friends are rarely 
online. Note also that pure availability can be higher than 
friend availability. For instance assume Figure [T] represents 
one day and Bob is Alice's only friend. Because Alice and 
Bob are never simultaneously online, M2 would be zero but 
Ml corresponds to Alice's availability. 

In the remainder of this paper we refer to the combination of 
the replication scheme x and the metric My as xMy, e. g., the 
combination of no replication and pure availability is referred 
to as QMl and indirect replication and friend availability is 
referred to as 2M2. 



III. Theoretical analysis of availability 

Data availability has already been theoretically addressed ii 
the context of peer-to-peer storage |[TOl . Despite the similarit; 
of making data available in peer-to-peer networks and dis 
tributed replication of OSN user data, these problems differ oi 
some fundamental points such as the model of replica failures 

A. Model and assumptions 

Let G{V,E) be the graph modeling the OSN friendshi] 
connectivity. We assume a discrete time model, and tha 
G does not vary over time. Let OC,(f) be the probabilit; 
that node ; is online at time t. We assume that all node 
have the same session characteristics: Vf,V/ G V, a,(f) = a{t) 
The graph dynamics is captured by nodes being offline o 
online. We also assume that the probability of two node 
being connected (online) at a given time t are independent of 
time: P{"i and j are online at time t") = ai{t)aj{t) = a{t)^ 
In other words connections are independent and identic all; 
distributed. In the following, a denotes the complement of a 

Most of the related work focuses on achieving a highl; 
reliable data storage system despite failures or departures o 
the peers holding the data. For instance in fTTl . a dynamii 
model is studied where peers continuously fail, and the repli 
cation rate must be adapted to compensate for these losses 
Our model on the other hand considers that friends are reliable 
they never fail nor depart from the system. Once the user dat; 
has been replicated to all its friends, the system is converged 
The main aspect of our model is to consider that nodes cai 
be temporarily disconnected by being dynamically online an( 
offline. 

7 ) Data availability metric: The goal of P2P storage sys 
tems is to store large amounts of data in a reliable fashion, a 
low cost. Considerations such as the available disk space of a 
replica or the available bandwidth are therefore important. For 
instance, some of the works on P2P systems consider the use 
of error correcting codes as a mean to reduce the consumed 
space while keeping the desired availability. In this work, we 
assume that enough storage is available at every node. 

In this paper, we are interested in the user's data availability: 
during which fraction of time is user ;'s data available? 
Formally, let A, be this availability, and Q. be the measuring 
period: 

Ai = - — - ^ P{i's data is available at time k). 
1^1 ken 

Given that replication is made only among directly con- 
nected neighbors, the problem we address is a local one with 
respect to G. As we will see, the most relevant graph property 
in our context is the out-degree of nodes, i. e., the number 
of potential replicas. Other graph properties such as diameter, 
connectedness or clustering do not have a major impact on 
availability. 

2) Converged case: We consider that a given node ; is in a 
converged state when all its neighbors have obtained a copy of 
f s data, and are therefore able to act as /'s data replica. This 
constitutes an upper bound of the availability A, since / cannot 
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Fig. 2. Match between simulations and analysis 
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Fig. 3. Impact of the degree on average pure availability (analysis). 

expect more replicas. Let « be the number of neighbors of node 
;. With the assumption of independence between online times, 
we have 

A; < P{i's replicas are all offline) = 1 - (1 - a(f))"+^ 

This is the maximal availability a node with n neighbors 
can hope for in our model. 

3) Non-converged case: Fix node /. Let A^, (f) be the random 
variable representing the number of /'s replica in the system 
at time t (be they online or not). We have P{Ni{Q) = 1) = 1, 
since the only available copy of /'s data at time is on node 
/. Note that A^,(r) is strictly increasing: no replica ever deletes 
a hosted user's data. Now, assume there are k — j replicas in 
the system at time f — 1. Then the probability that Ni{t) = k is 
the probabiUty that exactly j nodes became replicas at time t. 
In other words, using total probability law: 

P{N,{t) =k)= L'=oP{N,{t -\) = k-j)- 

P{X,^j\N,{t-l)=k-J). (1) 

Where X, is the random variable representing the number of 
replications happening at time f . Let us express P{Xt = j\Ni{t — 



1) =k — j), the probability that exactly j new replicas appear 
at time t. Note that we need to differentiate the case where 
no new replica is created at this time (j ~ 0) for the general 
case. Assume j = 0, two reasons for this: either no replica 
connected, or at least one replica was online, but no non- 
replica friend connected. Thus it writes: 

P{X,^Q\Ni{t-l)=k) 
= (1 - ait))" + (1 - a{t)r-%^, C-^)a{tni a{t) f-P 
(1 - a{t)f + (1 - a(f - (1 - a{t)f) 

Now assume j > 0, at least one replica is created, which also 
means that at least one replica is online, and exactly j friends 
that are still not replicas connect. In the following, using total 
probability law, we decompose according to the number p of 
replicas that are online at time t, and use the fact that if at 
least one replica is online, the number of new replicas created 
is exactly the number of non-replica friends that are online: 

PiX,=j\N,it^l)^k-j) = 
Y-tlJiPO^t = j and p online replicas |A^,(r — 1) = k — j) 



TABLE I 

Summary over relationship graphs 



= Lp={ ^( P online replicasjA^; (r — 1 ) = k — j). 

P{Xt = j\ p online replicas and Ni{t — 1) = k — j) 
= t/)aitni a(r))*-^'-".("-f ^■)a(f).'(l - ait))-' 

(1 - (1 - ait)f-j){"-)+j)aity{l - ait))"-'' 

Since we know that P{N{0) = 1) = 1 we can then iteratively 
compute P{N{t) = k). 

Now that we can describe the evolution of ;'s number of 
replicas over time, we can express the probability that ;'s data 
is available at time f: it is the complement of finding all the 
available replicas offline. Let A{t) be the event "/'s data is 
available at time f". We have 

P{A{t)) = f m{t) = m - (1 - ait))'). (2) 

Using this relation, it is possible to numerically estimate the 
average availability of a given node over time with Monte- 
Carlo simulation (code is available lfT2l ). Figure |2] compares 
the analytically derived average availability over a day with 
the results of the simulation, for different online patterns. We 
consider here nodes with a degree d = 22, and the probability 
of a node to be online in a given time bin is drawn uniformly 
at random over the day (a(f) — a). The theoretical maximum 
availability, i. e., the availability achieved in the converged 
case, is also plotted. The gap between the maximal availability 
and both analysis and simulations of convergence decreases as 
the average online time increases. This confirms the intuition 
that the more often nodes are online, the faster the converged 
situation is reached, i. e., all potential replicas have a copy of 
the data. 

Figure |3] presents the average availability of nodes as a 
function of their average online time for various node degrees. 
The point {x = 0.2,^ = 0.65) on the lower (blue) curve shows 
that according to our analysis, nodes with 6 neighbors being 
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online with 20% probability in each time bin have on average 
65% availability. This illustrates the strong impact of node 
degree on availability: nodes with 30 neighbors need to be 
online nearly 3 times less to achieve the same availability, 
and can expect more than 97% availability with 20% online 
time. 

IV. Simulation methodology 

We evaluate DOSN content availability for different graphs 
and user online patterns. In this section, we describe the 
friendship graphs used, explain how we model the session 
characteristics of DOSN users, and present the simulation 
procedure. 

A. Social Graphs and their Node Degree Distribution 

As shown in Section |IIll the availability of content depends 
strongly on the social graph, especially its node degree dis- 
tribution. For this study we use relationship graphs from two 
well-known OSNs, as well as synthetically generated graphs. 
The synthetic graphs allow us to explore social graphs with 
different average node degrees as well as the influence of graph 
size. Table HI summarizes the relationship graphs we use in this 
study. 

Regarding the real-world graphs, we use graphs derived 
from crawls by Fritsch Us) for the G-StudiVZ graph and 
Cha et al. HH for the G-TwiTTER graph. StudiVZ is a popular 
Facebook like OSN for German speaking students. While 
G-StudiVZ is symmetric due to reciprocal friendships, G- 
TwiTTER is asymmetric. Being a follower is a unidirectional 
relation. 

The Twitter dataset includes a link from user A to user 

B, when the crawl revealed that A follows B. This notion 
of edge direction is contrary to ours. We consider edges to 
point in the direction in which content is transferred: From the 
content originator to the friend that is interested in the data. 
In Twitter content flows from the followed (B) to the follower 
(A). Therefore, when we want to use the Twitter dataset we 
need to consider the distribution of in-degrees, instead of out- 
degrees. 

Both of these graphs have more than a million nodes, G- 
TwiTTER has 50 million. Running simulations with that many 
nodes is not feasible when exploring all our combinations of 
simulation parameters. In order to reduce the computation time 
we synthetically generate smaller graphs (e. g., 100,000 nodes) 



TABLE II 

Summary over session characteristics 



Name 






Durations 




AiTivals/bin 


Sessions 
per day 




Qi 


median 


mean Q3 


fitted Model 


low high 


P-Facebook 
P-Radius w/o always-on 


Om37s 
BmlOs 


6m30s 
5ml6s 


69m 52m 
50m 19m 


WeibuU (0.4; 1284) [s] 
WeibuU (0.35; 550) [s] 


-97 % +107 % 
-52 % +44 % 


2.5 
4.5 







+ G-StudiVZ 






X G-SynStudiVZ-100k 






A G-Twitter 






o G-SynTwitter-1 00k 






G-Regular-100k-22 






Ha 





1e+00 1e+02 1e+04 1e+06 



X = Node Degree 

Fig. 4. CCDF of node degree distributions of the graphs used. Axis are 
cutoff at 100.000 nodes for increase comparability with the syntetic graphs. 

that follow the same node degree distribution as the real-world 
graphs, using the gengraph tool by Viger and Latapy ||T5| . 

To produce the input degree distributions for gengraph, we 
fit the G-StudiVZ degrees with a Weibull distribution with 
shape 0.9 and scale 22.5. For Twitter's in-degrees we fit a 
power-law distribution with a ~ 2.25 and a mean of 41, using 
the node degree distribution generator distrib, which is part 
of gengraph. We validate our fitting through Kolmogorov- 
Smirnov (KS) tests as well as visual inspection (Figure |4]i. 
For comparison, we also generate regular graphs, i. e., graphs 
where each node has the same degree, with gengraph. We 
use these graphs to study the impact of different (average) 
node degrees and in contrast to the heavy-tailed real-world 
distributions. 

In our simulations we do not consider that the underlying 
social graphs are changing. However, as users go online and 
offline the graph of active users is highly dynamic. We char- 
acterize and model these changes in the following subsection. 

B. Session characteristics 

For our time-based simulations, it is important to reproduce 
the session characteristics of DOSN users. Although we were 
able to get access to relationship graphs from real-world OSNs, 
there are no publicly available models of session durations 
and arrivals for those graphs. Therefore, we rely on two 
types of session characteristics derived from data gathered 
in the aggregation network of a large European ISP. First, 
we consider session characteristics of Facebook sessions ob- 
served from about 2,000 users (P-Facebook), see Schneider 
et al. 1161 for details on the dataset and a study of popular 



user interactions with OSNs. Second, we consider DSL session 
characteristics (P-Radius) from about 20,000 customers, see 
Maier et al. |17| for details on the dataset. 

From our DSL session data we observed that a significant 
fraction (57 %) of lines are always connected, i. e., they 
perform an automatic reconnect upon disconnects from the 
ISP. The exact fraction heavily depends on the (default) 
configuration of the DSL router, and is subject to change 
depending on the set of services provided by the ISP, e.g., 
a VoIP customer DSL line should be always connected. We 
choose to filter out those sessions and instead use a parameter 
in the simulations denoting the fraction of always-on nodes. 

Because the number of users for which we observed the 
session characteristics does not match with the sizes of our 
social graphs, we need to determine a session model that can 
scale with the number of users. For each session we identify 
the login and logout timestamps and model the session start 
times and durations. Our session start time model accounts for 
time of day effects. We model the arrivals using a modulated 
Poisson process with 20-minute bins. In each bin the rate is 
modulated according to the observed probability that a session 
is starting in that bin. 

We did not observe a strong correlation between the ses- 
sion's start time (of day) and its duration. Pearson's correlation 
coefficient is -0.28 and -0.06 for P-Facebook and P- 
Radius, respectively. Therefore, we model the session dura- 
tions independently from the session start times, using Weibull 
distributions. Again we validate the fittings using KS tests. 

Table summarizes the session characteristics observed. 
We show the mean durations as well as the median, together 
with the fitted distributions. Note that for P-RADIUS the 
statistics and the model only correspond to those DSL lines 
which are not permanently connected (referred to as "w/o 
always-on"). In addition, we show the deviation from the 
average number of arrivals per bin and number of sessions 
per user per day. 

C. Simulation setup 

Our simulation takes three parameters: (i) the social graph, 
(ii) the session characteristics, and a configuration that 
determines the duration of the experiment and the time period 
considered to compute the availability metrics. 

First we select a social graph which in turn defines A^, the 
number of nodes/users in the DOSN, and the node degree 
distribution as described in Section IIV-AI 

Second, we select one of the two session models from 
Section ITV-BI and generate a (possibly empty) set of presence 
times for each of the nodes of the graph. A presence time 
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Fig. 5. Comparing G-StudiVZ and synthetic graphs of varying size, using 
P-Facebook and 2M1: Using small synthetic graphs is viable. 
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Fig. 6. Major gains in availability' using replication schemes (2,1) over no 
replication (0) for P-FACEBOOK. 1 is only marginally worse than 2. 



consists of the start and end time of an online session. In 
addition to the session model, the online times of users depend 
on the fraction P of nodes that are always connected. Note that 
the session models include time-of-day effects, that is nodes 
are online and offline depending on the time of day. Further 
the process used to generate the sessions randomly applies 
different sessions to all users, so that two users do not have 
the same on/off-pattern. 

We generate a total of {1 — P) x N x presence times, 
where is the average number of sessions per day and 
user (see Table This allows to compare the results for 
graphs with different numbers of nodes. The session start 
times follow the time-of-day modulated bins. The session end 
times are produced by adding a duration drawn from the 
Weibull distribution corresponding to the session model. To 
keep simulation runs with the same session model comparable, 
we generate one big set of session durations per model and use 
them subsequently. Each session lasts for at least 5 seconds. 

For the remaining P xN nodes, we set the session start time 
to the beginning of the simulation and the session end time to 
the end of the simulation time. Once everything is prepared we 
compute the availability of each user's content for the different 
replication schemes and availability metrics in one pass. 



V. Simulation results 

In this section, we start by demonstrating that our reduced 
size graphs accurately capture availability. Then, we study 
the availability that can be expected from our socially-driven 
replication schemes, and highlight the importance of the way 
the availability metric is defined. We also discuss the impact of 
different social graphs and user session models. We close this 
section by assessing the improvements in availability gained 
from a fraction of always-online nodes as well as considering 
longer simulation periods. 



A. Graph sizes do not matter 

In the remainder of this paper, we present results as cumu- 
lative distribution function (CDF) of the content availability 
(y-axis) as a function of the fraction of nodes/users (x-axis). 

Figure|5] shows the availability of G-StudiVZ based graphs 
with a P-Facebook session model over 24 h simulation 
time for 2M1. Before computing the CDFs, we remove all 
nodes from the simulation output which never go online, and 
therefore never generate any content. For all simulations this 
exclusion affects around 6 % of the nodes for P-Facebook 
and 1 % for P-Radius. 

Since simulating the replication process is an expensive task 
in terms of computational time, we do most of the experiments 
on scaled-down graphs. In Section IIV-AI we already showed 
the match between synthetic graphs and real-world graphs 
in terms of node degree distribution. In Figure |5] we now 
show that also the resulting availability is very similar, cf. G- 
StudiVZ and G-SynStudiVZ-1M. Moreover, scahng down 
the size of the network does not affect the availability distribu- 
tion either, cf. G-SynStudiVZ-1k, G-SynStudiVZ-10k 
and G-SynStudiVZ-100k. 

Based on this insight, we use the synthetic graphs with 
100,000 nodes for all further simulations. Using smaller 
graphs significantly speeds up simulation time and thereby 
allows to explore a broader variety and combination of pa- 
rameters. Unless otherwise mentioned we present results for 
G-SynStudiVZ-100k and P-Facebook for 24 h, which 
serves as our baseline scenario. 

B. The Gain of Replication 

Next, we turn to understanding the impact of the different 
replication schemes that we introduced in Section III-BI Fig- 
ure|6]shows 0M7, \M1 and 2M1 for our baseline scenario. As 
compared to QMl which represents the availability of the nodes 
themselves (no replication), we observe a drastic increase in 
availability for both replication schemes 1 and 2. While 2 
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vs. total simulation time (Ml): M2 yields higher availability. 
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Fig. 8. Impact of graph type on availability, using P-Facebook 



offers limited gains in availability for the bottom and top 5 % 
of nodes ranked by 0, the vast majority (90 %) of nodes double 
their availability and the median availability is shifted from 
less than 5% to more that 35% — an increase by a factor of 7. 
The bottom 5 % of nodes suffer from the reduced opportunities 
to hand over their data, due to their own limited online time. 
The top 5 % do not gain much as their own online time is 
already high. 

Taking a detailed look at Figure |6] we see that replication 
schemes R2 and Rl produce very similar results. As expected, 
2 has better availability compared to 1. However, the similar 
results indicate that passing along content over multiple friends 
does not happen often. One explanation for this is that contacts 
with the data owner occur before contacts between two friends. 
Indeed, increasing the simulation time to multiple days did not 
change this result. 

Note that due to the small difference between 1 and 2 and 
given that the restriction of allowing to copy the data only from 
the originator is artificial, e. g., when relying on encryption 
of content, we refrain from presenting the results for 1 in the 
sequel. Further, always produces the same curve for identical 
presence times, which only varies for different graph sizes and 
session characteristics. 

C. Difference between availability metrics 

So far, we looked at the availability measured in relation 
to the total time of the experiment, which we defined in 
Section Hn as our first metric Ml. Our second metric M2 is the 
data availability measured in relation to the online time of a 
user's friends. In FigureQwe observe that 2M2 is a significant 
improvement over 2M1 for the baseline scenario. The median 
availability increases to around 60 %. This corresponds to a 
30 % gain for more than 60 % of the nodes over 2M1. 

However, the availability of about 8 % of nodes reduces 
to zero for 2M2, cf. bottom left part of plot. Indeed, if no 
friend is online at the same time as the node itself, M2 reports 
while Ml reports the node's own availability. Our finding 



from the previous subsection holds, in that 2 is slightly better 
than 1 when comparing IM2 with 2M2 (not shown). Without 
replication, M2 is still better than Ml but by very little. 
Therefore, from now on we focus on QMl (as reference), as 
well as 2M1 and 2M2 whenever useful. 

D. Impact of graphs types and node degree 

As introduced in Section IIV-AI we use different types of 
graphs with different node degree distributions. In Figure [H] 
we present the availability of two graphs in addition to our G- 
SynStudiVZ-IOOk baseline graph: G-SynTwitter-100k 
and G-Regular-IOOk-22. We choose the average node 
degree of G-StudiVZ (22) as node degree for the regular 
graphs, to allow comparison. 

G-Regular-IOOk-22 achieves the best availabihty, fol- 
lowed by G-SynStudiVZ-100k. G-SynTwitter-100k 
has the worst availability, although it has the highest average 
node degree. From this observation, we conclude that the 
degree distribution is more important than the average degree 
for availability. Both synthetic graphs have a lot of nodes with 
only a few friends, so that their opportunities to get their 
data repHcated is lower. Indeed, G-SynTwitter-100k has 
significantly more nodes with single digit node degrees than 
G-SynStudiVZ-100k (50 % over 35 %). The G-Regular- 
IOOk-22 graph does not have nodes with zero 2M2 availabil- 
ity, as nodes do not need to rely on the opportunity to meet 
their one and only friend. 

To study the impact of a graph's node degree, we consider 
different versions of G-Regular-IOOk-?. Besides the degree 
of 22 we also used G-Regular- 1 00k- 1 0, and G-Regular- 
IOOk-42. Figure |9] shows that the degree of each individual 
node has an impact on the availability, although it is limited, 
as can be observed through the saturation effect around 22 
neighbors. While 2M1 is always better for higher node degrees, 
2M2 is slightly worse for the top half of nodes (right sides of 
plot). 
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Fig. 10. Impact of session ciiaracteristics on availability, using G- 
SynStudiVZ-IOOk 



E. Impact of session characteristics 

As described in Section IIV-BI we use mainly two different 
distributions for the session characteristics, P-Radius and P- 
Facebook. Figure [TO] compares those on G-StudiVZ. P- 
Radius shows higher availabihty for 2M1 and 2M2. Yet, for 
0M7 the difference is limited. However, the gap between the 
two session models is certainly bigger than the gap between 
different graphs, highlighting the higher importance of the 
session models compared to the graph. 

The difference in availability for the replication cases, 
corresponds to the expected total online time of all nodes. 
As shown in Table [III the product of session per day and 
average duration is higher for P-Radius. This matches the 
expectation that a user usually is more often and longer online 
in the Internet than logged in to Facebook. Obviously, a user 
first needs to establish an Internet access connection before 
using Facebook. 

The reason why the higher total online time of P-Radius is 



not as pronounced in the QMl curve is that the median session 
durations are tiny compared to the simulation duration, e.g., 
6m30s corresponds to a value of 0.0045 on the y-axis. 

F. Adding always on nodes 

Many DOSNs rely on external storage and consider storage 
as always available. Similarly, we saw more than half (57 %) of 
all P-Radius sessions being connected all the time. Therefore 
we now explore how much we can improve availability by 
replacing a fraction P of nodes with always-on nodes. This can 
be achieved by running the DOSN client on a high availability 
platform, e.g., their home gateway which is usually always 
online, instead of their computer. 

Figure [TT] (legend on top) shows our baseline scenario for 
P~ {0,5, 10,25,50,75}%. Note, the compressed curves due to 
this fraction of always available nodes. Replacing only 5 % of 
the nodes, already increases the 2M1 availability of the 10% 
top most available nodes to almost full availability. 

G. Considering longer time-frames 

If in addition to always-on nodes, we consider a longer 
time frame until the data is requested, more contacts will 
have occurred and the content is spread to more replicas. 
Considering a time frame of a full week instead of 24 h, friends 
(2M2) have almost a 100% guarantee to obtain the desired 
data, as shown in Figure [121 In the most extreme case, looking 
at a 7 day experiment with 75 % always on nodes, each nodes 
availability is at least 80 %. 

Observing such high availability we are interested in de- 
termining the convergence towards this full weeks results. 
Therefore, in Figure [T3] we plot the 2M2 availability of 
DOSN contents for each day of the week. We observe steady 
but decreasing improvement on each day. Culminating in an 
almost vertical curve, at the last day of the week more than 
80 % of all the nodes have full availability. 
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VI. Related work 

Two major strategies have been studied for data replication 
in distributed systems: replicate the whole file or replicate the 
pieces of a file. Works such as ifTSl . |[T9l are examples of 
the first strategy. PAST HSl is a P2P file system built on top 
of Pastry ll20l . a DHT-based system. 1191 explores different 
variations of the gain of replicating data to randomly chosen 
hosts on data availability. [,191 shows that implicit replication 
that follows file popularity, as in Gnutella and BitTorrent, 
is not sufficient. The second replication approach splits a 
file into a pre-defined number of pieces and distributes them 
across the nodes. Examples of this approach include e.g., 
II2TI . or II22I . 1231 extensively covers the related work in this 
area. Replicating the whole file and pieces of a file can be 
combined, as done in OceanStore Q that uses a combination 
of different replication strategies for different data types, e. g., 
erasure codes for archiving. ifTOI compares the two replication 
strategies and finds that erasure codes lead to better availability 
than redundancy. 



Availability has been studied in the context of P2P networks. 
1241 explores the properties of different coding schemes and 
finds that schemes which give better availability are in general 
the ones that have higher maintenance cost. Q measures and 
analyzes host availability in a large structured P2P file sharing 
network. They find that host availabihty is dependent on the 
time of day, but not on other hosts. 125)1 confirmed the results 
of Q in Gnutella and Napster. l26]| also explore Gnutella and 
Napster but focus on the nature of end-hosts, and find they 
are very heterogeneous, e. g., regarding access bandwidth and 
their limited cooperation. 

DOSNs have gained popularity in the research community 
in the recent years, aimed at giving back users control over 
their data. Most DOSN projects concentrate on the question of 
access rights management and encryption. They rely on servers 
or other external services to guarantee data availability. Others 
use social links for trust but do not exploit it for DOSN data 
availability. Tribler 16] is a file sharing system built on top of a 
social overlay. It uses a complex replication system to ensure 
the best possible availability of data. Persona [2| relies on 
external storage services that can be anything from a server to 
an Amazon cloud account per user. Persona enables privacy by 
introducing a fine grained system that users can use to manage 
access to their data. Diaspora |3| recently obtained much 
attention from the press. From the users viewpoint. Diaspora 
appears to be a common server-based OSN. However, data can 
be encrypted and everyone can set up a server for Diaspora, 
so that availability is ensured by many distributed servers. 
SuperNova l27l is an architecture for a DOSN that solves 
the availability issue by relying on super-peers that provide 
highly available storage. 

VII. Conclusion 

In this paper, we proposed to exploit on the structure of the 
social graphs in DOSNs, to replicate users' content. As the 
friends of a given user are interested in his content anyway, 
they can be used to provide replicas of his content. Through 
theoretical analysis as well as simulations based on data from 
OSN users, we study this natural replication scheme. We find 
that with such a replication scheme, the availability of the users 
content increases drastically, when compared to the online time 
of the users, e. g., by a factor of more than 2 for 90 % of the 
users. Furthermore, adding a small fraction of always online 
users to our scheme leads to high overall availability of users' 
content. 

As future work, we want to study the speed at which user 
content updates propagate through the graph of DOSNs, as 
well as compare our replication scheme with others that are 
not purely based on the friendship graph. 

We also need to leave studies to future work that look at 
the overhead in terms of traffic, time, messaging generated by 
a system similar to the presented one. Such a system would 
need to be implemented, deployed and observed while in use 
to find out about those details. 
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